Introduction

In the grad school classes (Machine Learning for Business Managers, Network Analytics, Fintech, Data Driven Customer Management) I took in my last few terms at B-school, I was introduced to Network Analytics and Google search algorithm in more detail. Google's organic search algorithm seemed like a perfect match of network analytics concepts like pagerank, eigenvector and basic NLP algorithms I learned in above mentioned classes. Hence, this post is about implementing search algorithm using Python and libraries such as sklearn, networkx.

About

We will implement search based on user queries for TED talks based on data available at Kaggle and show top 5 results. We will use Pandas - a data manipulation library, Sklearn - a machine learning library, and Networkx - a Python library for studying graphs. Ideally, our implemented search engine should give results similar to TED website search (https://www.ted.com/search?q=technology+and+robots)

Conceptual Overview

To provide top search results Search Engines like Google, Bing look at 3 distinct sources of information-

  1. Content Relevance:Vector Model - It looks at the relevance of content with search keywords. We will use an important metric Term Frequency-Inverse Document Frequency (TFIDF) to find out important words in the transcript data of the TED talks. More prominent are the search query words in a document higher is the TFIDF score.
  2. Network Popularity:Citiation Model - Famously known as pagerank (Initially, Google search's competitive advantage over other search engines like Yahoo). Web search engines use a crawler to create a network of web pages where web pages cited by other important web pages have higher pagerank. For our purpose, we create a network (directed graph) of TED talks as nodes and a directed edge if destination TED talk is recommendation in source TED talk. We use eigenvector metric, pagerank is a variant of eigenvectors. We use Python Networkx library to build this.
  3. Behaviorial Data - Metrics such as CTR (click through rate) and time spent on web page etc are also looked at.</b>

In this tutorial, we focus one the first 2 i.e. Content Relevance and Network Popularity. We have limited behaviorial data such as views, comments on TED videos. We will incorporate them in next version.</p> </div> </div> </div>

import networkx as nx
import pandas as pd
import os
import json
import ast
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from IPython.display import Image
from IPython.core.display import HTML 
pd.set_option('display.max_colwidth',1000)

Data preprocessing

  • We load csv files and create dataframe
  • As visible in the output below, final dataframe contains columns - transcript, url, and title

#collapse-show
path_to_data = os.getcwd() + "\\data\\"
ted_main_filepath = path_to_data + "ted_main.csv"
transcripts_filepath = path_to_data + "transcripts.csv"

ted_main_df = pd.read_csv(ted_main_filepath)
ted_main_df = ted_main_df[['title', 'url', 'related_talks']]
transcripts_df = pd.read_csv(transcripts_filepath)

#merge the two dataframes to create one. 
final_ted_df = transcripts_df.merge(ted_main_df, on="url")

final_ted_df.head(1)
transcript url title related_talks
0 Good morning. How are you?(Laughter)It's been great, hasn't it? I've been blown away by the whole thing. In fact, I'm leaving.(Laughter)There have been three themes running through the conference which are relevant to what I want to talk about. One is the extraordinary evidence of human creativity in all of the presentations that we've had and in all of the people here. Just the variety of it and the range of it. The second is that it's put us in a place where we have no idea what's going to happen, in terms of the future. No idea how this may play out.I have an interest in education. Actually, what I find is everybody has an interest in education. Don't you? I find this very interesting. If you're at a dinner party, and you say you work in education — Actually, you're not often at dinner parties, frankly.(Laughter)If you work in education, you're not asked.(Laughter)And you're never asked back, curiously. That's strange to me. But if you are, and you say to somebody, you know, the... https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity\n Do schools kill creativity? [{'id': 865, 'hero': 'https://pe.tedcdn.com/images/ted/172559_800x600.jpg', 'speaker': 'Ken Robinson', 'title': 'Bring on the learning revolution!', 'duration': 1008, 'slug': 'sir_ken_robinson_bring_on_the_revolution', 'viewed_count': 7266103}, {'id': 1738, 'hero': 'https://pe.tedcdn.com/images/ted/de98b161ad1434910ff4b56c89de71af04b8b873_1600x1200.jpg', 'speaker': 'Ken Robinson', 'title': "How to escape education's death valley", 'duration': 1151, 'slug': 'ken_robinson_how_to_escape_education_s_death_valley', 'viewed_count': 6657572}, {'id': 2276, 'hero': 'https://pe.tedcdn.com/images/ted/3821f3728e0b755c7b9aea2e69cc093eca41abe1_2880x1620.jpg', 'speaker': 'Linda Cliatt-Wayman', 'title': 'How to fix a broken school? Lead fearlessly, love hard', 'duration': 1027, 'slug': 'linda_cliatt_wayman_how_to_fix_a_broken_school_lead_fearlessly_love_hard', 'viewed_count': 1617101}, {'id': 892, 'hero': 'https://pe.tedcdn.com/images/ted/e79958940573cc610ccb583619a54866c41ef303_2880x1620.jpg', 's...
# Pagerank algorithm tries to find the most prominent web pages in a network of web pages. 
# Essentially pagerank of a webpage or node in the network is dependent on its immediate neighbour's rank and so and so forth.
# A node with a higher pagerank is cited by other highly pageranked nodes

# In our case of vertical search of TED videos based on search query, we will create a directed graph of ted videos as nodes
# and directed edge to all related ted videos from the source ted video.
# Assumption: If a ted video is in recommendations of high ranked ted videos it must be high ranked as well.

#To create a directed graph we will use networkx library
# we need to create a dataframe of all edges (source ted video, recommended ted video)
recommendations_df = final_ted_df[["title","related_talks"]]
print(recommendations_df)
                                                          title  \
0                                   Do schools kill creativity?   
1                                   Averting the climate crisis   
2                                              Simplicity sells   
3                                           Greening the ghetto   
4                               The best stats you've ever seen   
...                                                         ...   
2462         What we're missing in the debate about immigration   
2463                            The most Martian place on Earth   
2464  What intelligent machines can learn from a school of fish   
2465               A black man goes undercover in the alt-right   
2466         How a video game might help us build better cities   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                related_talks  
0     [{'id': 865, 'hero': 'https://pe.tedcdn.com/images/ted/172559_800x600.jpg', 'speaker': 'Ken Robinson', 'title': 'Bring on the learning revolution!', 'duration': 1008, 'slug': 'sir_ken_robinson_bring_on_the_revolution', 'viewed_count': 7266103}, {'id': 1738, 'hero': 'https://pe.tedcdn.com/images/ted/de98b161ad1434910ff4b56c89de71af04b8b873_1600x1200.jpg', 'speaker': 'Ken Robinson', 'title': "How to escape education's death valley", 'duration': 1151, 'slug': 'ken_robinson_how_to_escape_education_s_death_valley', 'viewed_count': 6657572}, {'id': 2276, 'hero': 'https://pe.tedcdn.com/images/ted/3821f3728e0b755c7b9aea2e69cc093eca41abe1_2880x1620.jpg', 'speaker': 'Linda Cliatt-Wayman', 'title': 'How to fix a broken school? Lead fearlessly, love hard', 'duration': 1027, 'slug': 'linda_cliatt_wayman_how_to_fix_a_broken_school_lead_fearlessly_love_hard', 'viewed_count': 1617101}, {'id': 892, 'hero': 'https://pe.tedcdn.com/images/ted/e79958940573cc610ccb583619a54866c41ef303_2880x1620.jpg', 's...  
1     [{'id': 243, 'hero': 'https://pe.tedcdn.com/images/ted/566c14767bd62c5ff760e483c5b16cd2753328cd_2880x1620.jpg', 'speaker': 'Al Gore', 'title': 'New thinking on the climate crisis', 'duration': 1674, 'slug': 'al_gore_s_new_thinking_on_the_climate_crisis', 'viewed_count': 1751408}, {'id': 547, 'hero': 'https://pe.tedcdn.com/images/ted/89288_800x600.jpg', 'speaker': 'Ray Anderson', 'title': 'The business logic of sustainability', 'duration': 954, 'slug': 'ray_anderson_on_the_business_logic_of_sustainability', 'viewed_count': 881833}, {'id': 2093, 'hero': 'https://pe.tedcdn.com/images/ted/146d88845861cbf768bbf8bec8b2e41f8bfc7903_2400x1800.jpg', 'speaker': 'Lord Nicholas Stern', 'title': 'The state of the climate — and what we might do about it', 'duration': 993, 'slug': 'lord_nicholas_stern_the_state_of_the_climate_and_what_we_might_do_about_it', 'viewed_count': 773779}, {'id': 2784, 'hero': 'https://pe.tedcdn.com/images/ted/e835e670a7836cf65aca2a7a644fd94398cb4b8e_2880x1620.jpg', 'spe...  
2     [{'id': 1725, 'hero': 'https://pe.tedcdn.com/images/ted/b7f415a054cc0a2bfdd90d0ad5a7f64cf060150d_1600x1200.jpg', 'speaker': 'David Pogue', 'title': '10 top time-saving tech tips', 'duration': 344, 'slug': 'david_pogue_10_top_time_saving_tech_tips', 'viewed_count': 4843421}, {'id': 2274, 'hero': 'https://pe.tedcdn.com/images/ted/608e677e4392bcdcf82b068fa221b9df74a213ef_2880x1620.jpg', 'speaker': 'Tony Fadell', 'title': 'The first secret of design is ... noticing', 'duration': 1001, 'slug': 'tony_fadell_the_first_secret_of_design_is_noticing', 'viewed_count': 2005916}, {'id': 172, 'hero': 'https://pe.tedcdn.com/images/ted/b790be2f87ceffba73fe73837944400c7d61cba2_1600x1200.jpg', 'speaker': 'John Maeda', 'title': 'Designing for simplicity', 'duration': 959, 'slug': 'john_maeda_on_the_simple_life', 'viewed_count': 1215942}, {'id': 2664, 'hero': 'https://pe.tedcdn.com/images/ted/092f184f6625c2aeef10949c8d7b2aa14ba4132b_2880x1620.jpg', 'speaker': 'Dan Bricklin', 'title': 'Meet the invento...  
3     [{'id': 1041, 'hero': 'https://pe.tedcdn.com/images/ted/96c703bb13a2e9c2d351a5e6b52390bc35eaad06_800x600.jpg', 'speaker': 'Majora Carter', 'title': '3 stories of local eco-entrepreneurship', 'duration': 1079, 'slug': 'majora_carter_3_stories_of_local_ecoactivism', 'viewed_count': 702642}, {'id': 1892, 'hero': 'https://pe.tedcdn.com/images/ted/f5ebbf91eb093a2da2cfe1941724a3e55d222713_1600x1200.jpg', 'speaker': 'Toni Griffin', 'title': 'A new vision for rebuilding Detroit', 'duration': 708, 'slug': 'toni_griffin_a_new_vision_for_rebuilding_detroit', 'viewed_count': 826727}, {'id': 2078, 'hero': 'https://pe.tedcdn.com/images/ted/92ddb109a1f98a3745fe1b2b0d2c5519ab3931dc_2400x1800.jpg', 'speaker': 'Dan Barasch', 'title': 'A park underneath the hustle and bustle of New York City', 'duration': 377, 'slug': 'dan_barasch_a_park_underneath_the_hustle_and_bustle_of_new_york_city', 'viewed_count': 862197}, {'id': 2873, 'hero': 'https://pe.tedcdn.com/images/ted/635f76d5d4d0a454b6131b93b471d1f65...  
4     [{'id': 2056, 'hero': 'https://pe.tedcdn.com/images/ted/afc9b259845ec1b543419871e10753d4d9044fda_2400x1800.jpg', 'speaker': 'Talithia Williams', 'title': "Own your body's data", 'duration': 1023, 'slug': 'talithia_williams_own_your_body_s_data', 'viewed_count': 1345319}, {'id': 2296, 'hero': 'https://pe.tedcdn.com/images/ted/9e003b0a822daba702608136e73f7be001b6b2f2_2880x1620.jpg', 'speaker': 'Manuel Lima', 'title': 'A visual history of human knowledge', 'duration': 769, 'slug': 'manuel_lima_a_visual_history_of_human_knowledge', 'viewed_count': 1710818}, {'id': 620, 'hero': 'https://pe.tedcdn.com/images/ted/111517_800x600.jpg', 'speaker': 'Hans Rosling', 'title': 'Let my dataset change your mindset', 'duration': 1196, 'slug': 'hans_rosling_at_state', 'viewed_count': 1471015}, {'id': 974, 'hero': 'https://pe.tedcdn.com/images/ted/205077_800x600.jpg', 'speaker': 'Hans Rosling', 'title': "The good news of the decade? We're winning the war against child mortality", 'duration': 934, 'slu...  
...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ...  
2462  [{'id': 2596, 'hero': 'https://pe.tedcdn.com/images/ted/00702bc1ae05ccf5f79b44ec84955517bbf030fa_2880x1620.jpg', 'speaker': 'Sayu Bhojwani', 'title': 'Immigrant voices make democracy stronger', 'duration': 762, 'slug': 'sayu_bhojwani_how_immigrant_voices_make_democracy_stronger', 'viewed_count': 783703}, {'id': 2813, 'hero': 'https://pe.tedcdn.com/images/ted/73702d7b558bf8dbcd910ecaaf3aa323c9125bf7_2880x1620.jpg', 'speaker': 'Jorge Ramos', 'title': 'Why journalists have an obligation to challenge power', 'duration': 870, 'slug': 'jorge_ramos_why_journalists_have_an_obligation_to_challenge_power', 'viewed_count': 383092}, {'id': 1368, 'hero': 'https://pe.tedcdn.com/images/ted/d7ec12559193290dff988c6a79ace15db5abc471_1600x1200.jpg', 'speaker': 'Tan Le', 'title': 'My immigration story', 'duration': 736, 'slug': 'tan_le_my_immigration_story', 'viewed_count': 1072475}, {'id': 2038, 'hero': 'https://pe.tedcdn.com/images/ted/0c67ab6739fbb74fdd2cde3b0ea98741c2926d3f_2400x1800.jpg', 'speake...  
2463  [{'id': 2491, 'hero': 'https://pe.tedcdn.com/images/ted/a8dbff8cfccb989af849c15247d98d81283915c8_2880x1620.jpg', 'speaker': 'Carrie Nugent', 'title': 'Adventures of an asteroid hunter', 'duration': 366, 'slug': 'carrie_nugent_adventures_of_an_asteroid_hunter', 'viewed_count': 976519}, {'id': 2656, 'hero': 'https://pe.tedcdn.com/images/ted/a21cadeb1b1d90c40e51509d92f9837b16e0290d_2880x1620.jpg', 'speaker': 'Anjali Tripathi', 'title': 'Why Earth may someday look like Mars', 'duration': 715, 'slug': 'anjali_tripathi_why_earth_may_someday_look_like_mars', 'viewed_count': 975612}, {'id': 2677, 'hero': 'https://pe.tedcdn.com/images/ted/62d8533e4c1069b858f2fe52a24bef69dcf07aa7_2880x1620.jpg', 'speaker': 'Nagin Cox', 'title': 'What time is it on Mars?', 'duration': 827, 'slug': 'nagin_cox_what_time_is_it_on_mars', 'viewed_count': 1374122}, {'id': 421, 'hero': 'https://pe.tedcdn.com/images/ted/321b76428b3b65c63d4eaec56dadbc54dc40362f_1600x1200.jpg', 'speaker': 'Penelope Boston', 'title': 'T...  
2464  [{'id': 2346, 'hero': 'https://pe.tedcdn.com/images/ted/87f82eddb91d2c806ebb11465d846c47f481902c_2880x1620.jpg', 'speaker': 'Vijay Kumar', 'title': 'The future of flying robots', 'duration': 789, 'slug': 'vijay_kumar_the_future_of_flying_robots', 'viewed_count': 1349369}, {'id': 2825, 'hero': 'https://pe.tedcdn.com/images/ted/0cab5e995cfea069cbc5fc24f04d69ab951eefce_2880x1620.jpg', 'speaker': 'Marc Raibert', 'title': 'Meet Spot, the robot dog that can run, hop and open doors', 'duration': 873, 'slug': 'marc_raibert_meet_spot_the_robot_dog_that_can_run_hop_and_open_doors', 'viewed_count': 1475849}, {'id': 2852, 'hero': 'https://pe.tedcdn.com/images/ted/dec243629fa6b3bdde879a0d2bc8d831ac894904_2880x1620.jpg', 'speaker': 'Noriko Arai', 'title': 'Can a robot pass a university entrance exam?', 'duration': 817, 'slug': 'noriko_arai_can_a_robot_pass_a_university_entrance_exam', 'viewed_count': 773331}, {'id': 1376, 'hero': 'https://pe.tedcdn.com/images/ted/8aa84e7e5d405e75f19fc51bf6f99183...  
2465  [{'id': 2512, 'hero': 'https://pe.tedcdn.com/images/ted/6094382cb03573581a6b2e3f6c7f0ce1adf7173c_2880x1620.jpg', 'speaker': 'Joseph Ravenell', 'title': 'How barbershops can keep men healthy', 'duration': 788, 'slug': 'joseph_ravenell_how_barbershops_can_keep_men_healthy', 'viewed_count': 993599}, {'id': 1378, 'hero': 'https://pe.tedcdn.com/images/ted/537e4f8ab618be6cf3d40287aa04df004f543c2f_1600x1200.jpg', 'speaker': 'Bryan Stevenson', 'title': 'We need to talk about an injustice', 'duration': 1421, 'slug': 'bryan_stevenson_we_need_to_talk_about_an_injustice', 'viewed_count': 3792347}, {'id': 2837, 'hero': 'https://pe.tedcdn.com/images/ted/1db17412c95870c4a467c0e3c6ea9fe28336f87e_2880x1620.jpg', 'speaker': 'Damon Davis', 'title': 'Courage is contagious', 'duration': 325, 'slug': 'damon_davis_what_i_saw_at_the_ferguson_protests', 'viewed_count': 721768}, {'id': 2802, 'hero': 'https://pe.tedcdn.com/images/ted/0fc3715f6144e8d835e4ad63057656bccbf8e67f_2880x1620.jpg', 'speaker': 'Trista...  
2466  [{'id': 2682, 'hero': 'https://pe.tedcdn.com/images/ted/5344a548b578587ac392c3e05e0e604f55371d94_2880x1620.jpg', 'speaker': 'Jeff Speck', 'title': '4 ways to make a city more walkable', 'duration': 1117, 'slug': 'jeff_speck_4_ways_to_make_a_city_more_walkable', 'viewed_count': 1354747}, {'id': 2839, 'hero': 'https://pe.tedcdn.com/images/ted/7651fdc16fac4fe5a41e91a65ee168af109e227e_2880x1620.jpg', 'speaker': 'Peter Calthorpe', 'title': '7 principles for building better cities', 'duration': 860, 'slug': 'peter_calthorpe_7_principles_for_building_better_cities', 'viewed_count': 834219}, {'id': 1501, 'hero': 'https://pe.tedcdn.com/images/ted/1524765b73f465b35cdf9c4689674a42bdd5a917_1600x1200.jpg', 'speaker': 'Jane McGonigal', 'title': 'The game that can give you 10 extra years of life', 'duration': 1170, 'slug': 'jane_mcgonigal_the_game_that_can_give_you_10_extra_years_of_life', 'viewed_count': 6141800}, {'id': 1429, 'hero': 'https://pe.tedcdn.com/images/ted/f45ba92c04bdaedad7cf56e5182...  

[2467 rows x 2 columns]
def recommended_titles_list(reco_str):
    data = json.dumps(ast.literal_eval(reco_str))
    jdata = json.loads(data)
    titles_list = []
    for data in jdata:
        titles_list.append(data['title'])
    return titles_list
#Take each line from sheet and write to a graph with "title" and "related_title".
columns = ['title', 'related_title']
edges_df = pd.DataFrame(columns=columns)
for index, row in recommendations_df.iterrows():
    title = row['title']
    reco_list = recommended_titles_list(row['related_talks'])
    for reco_title in reco_list:
        edges_df = edges_df.append({'title':title, 'related_title':reco_title}, ignore_index=True)

print(edges_df.head(5))
# There are 14802 directed edges in the graph.
print(edges_df.shape)
                         title  \
0  Do schools kill creativity?   
1  Do schools kill creativity?   
2  Do schools kill creativity?   
3  Do schools kill creativity?   
4  Do schools kill creativity?   

                                            related_title  
0                       Bring on the learning revolution!  
1                  How to escape education's death valley  
2  How to fix a broken school? Lead fearlessly, love hard  
3                       Education innovation in the slums  
4                      A short intro to the Studio School  
(14802, 2)
# Create the directed graph from edges dataframe using networkx
di_reco_graph = nx.from_pandas_edgelist(edges_df,'title','related_title', create_using=nx.DiGraph())
#Print generic info about directed graph
print(nx.info(di_reco_graph))

# Pagerank is a variant of eigenvector. Hence we find eigenvectors for each node (ted_video)
eigenvector_dict = nx.eigenvector_centrality(di_reco_graph)

# normalize the eigenvectors (b/w 0 and 1)
factor=1.0/sum(eigenvector_dict.values())
normalised_eigenvector_dict = {k: v*factor for k, v in eigenvector_dict.items() }
#print(normalised_eigenvector_dict)
#print({k: v for k, v in sorted(normalised_eigenvector_dict.items(), key=lambda item: item[1], reverse = True)})

# Add the eigen vector to final_ted_df dataframe.
eigenvectors_df = pd.DataFrame(normalised_eigenvector_dict.items(), columns=['title', 'eigenvector_value'])
final_ted_df = final_ted_df.merge(eigenvectors_df,on="title")
print(final_ted_df.head(1))
Name: 
Type: DiGraph
Number of nodes: 2520
Number of edges: 14784
Average in degree:   5.8667
Average out degree:   5.8667
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                transcript  \
0  Good morning. How are you?(Laughter)It's been great, hasn't it? I've been blown away by the whole thing. In fact, I'm leaving.(Laughter)There have been three themes running through the conference which are relevant to what I want to talk about. One is the extraordinary evidence of human creativity in all of the presentations that we've had and in all of the people here. Just the variety of it and the range of it. The second is that it's put us in a place where we have no idea what's going to happen, in terms of the future. No idea how this may play out.I have an interest in education. Actually, what I find is everybody has an interest in education. Don't you? I find this very interesting. If you're at a dinner party, and you say you work in education — Actually, you're not often at dinner parties, frankly.(Laughter)If you work in education, you're not asked.(Laughter)And you're never asked back, curiously. That's strange to me. But if you are, and you say to somebody, you know, the...   

                                                                     url  \
0  https://www.ted.com/talks/ken_robinson_says_schools_kill_creativity\n   

                         title  \
0  Do schools kill creativity?   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             related_talks  \
0  [{'id': 865, 'hero': 'https://pe.tedcdn.com/images/ted/172559_800x600.jpg', 'speaker': 'Ken Robinson', 'title': 'Bring on the learning revolution!', 'duration': 1008, 'slug': 'sir_ken_robinson_bring_on_the_revolution', 'viewed_count': 7266103}, {'id': 1738, 'hero': 'https://pe.tedcdn.com/images/ted/de98b161ad1434910ff4b56c89de71af04b8b873_1600x1200.jpg', 'speaker': 'Ken Robinson', 'title': "How to escape education's death valley", 'duration': 1151, 'slug': 'ken_robinson_how_to_escape_education_s_death_valley', 'viewed_count': 6657572}, {'id': 2276, 'hero': 'https://pe.tedcdn.com/images/ted/3821f3728e0b755c7b9aea2e69cc093eca41abe1_2880x1620.jpg', 'speaker': 'Linda Cliatt-Wayman', 'title': 'How to fix a broken school? Lead fearlessly, love hard', 'duration': 1027, 'slug': 'linda_cliatt_wayman_how_to_fix_a_broken_school_lead_fearlessly_love_hard', 'viewed_count': 1617101}, {'id': 892, 'hero': 'https://pe.tedcdn.com/images/ted/e79958940573cc610ccb583619a54866c41ef303_2880x1620.jpg', 's...   

   eigenvector_value  
0           0.003404  
#TODO: Insert graphs from gephi, and modularity analysis
edges_df.to_csv('graph_edges.csv')
# Lets take a detour to Network Analytics using a WYSIWYG software called Gephi (Download here).
# We can create a directed graph from spreadsheet. We can analyse eigen vectors, degree, pagerank, and modularity.
# Modularity is....

image_folder_path = os.getcwd() + "\\img\\"
# Following is the picture of directed graph
Image(filename = image_folder_path + "overall_graph.png", width=250, height=250)
#Following is the picture of biggest modularity class (subgroup) and the associated data.
x = Image(filename = image_folder_path + "technology&innovation_module.png", width=250, height=250)
y = Image(filename = image_folder_path + "technology&innovation_module_data.png", width=800, height=800)
display(x,y)
#Following is the picture of second biggest modularity class (subgroup) and the associated data.
x = Image(filename = image_folder_path + "art_design_arch.png", width=250, height=250)
y = Image(filename = image_folder_path + "art_design_arch_data.png", width=800, height=800)
display(x,y)
# We have transcript of all the talks. Hence we can create a TFIDF keywords 
# We can create a TFIDF matrix of transript terms for all talks. We will use TFIDF vectorizer of SKLearn.
tfidf_vector = TfidfVectorizer(stop_words='english')
tfidf_values = tfidf_vector.fit_transform(final_ted_df['transcript'])
tfidf_matrix = tfidf_values.toarray()
print(tfidf_matrix.shape)
#it has fonud 58795 features (columns) for 2467 ted videos (rows)

# show some 50 features out of identified 58795 features.
print(tfidf_vector.get_feature_names()[5000:5050])
# If you scroll down it has lot of features (terms) identified from transcript. 
# As we see that some of the numbers have been identified as features which could be avoided by preprocessing data.
# It will be done in the next version.
(2467, 58489)
['baldness', 'baldwin', 'baldy', 'bale', 'baleen', 'baleful', 'balenciaga', 'balers', 'bales', 'balfour', 'bali', 'balikpapan', 'balinese', 'balk', 'balkan', 'balkans', 'balked', 'balkh', 'balkhi', 'ball', 'ballads', 'ballah', 'ballard', 'ballast', 'ballasted', 'ballbot', 'ballbots', 'balled', 'ballerina', 'ballet', 'balletic', 'ballets', 'ballgame', 'ballistic', 'ballistically', 'ballistics', 'ballmer', 'balloon', 'ballooning', 'balloonist', 'balloons', 'ballot', 'ballots', 'ballpark', 'ballplayer', 'ballpoint', 'ballroom', 'ballrooms', 'balls', 'ballsy']
#Now that we are done with eigen vectors and TFIDF. 
# For search query entered we need to create matching scores word in search query for all TED videos and sum them.
# Since we intend to show top 5 searches, we will show the top 5 TED videos based on matching scores.

#search_query = "schools"
#search_query = "technology and robots"
search_query = "inspiration and courage"
# Matching score for all the TED videos.
# Get search tokens in the search query
search_tokens = search_query.split(" ")

# Find the index of all search tokens in feature names obtained from TFIDF vectorizer.
feature_names = tfidf_vector.get_feature_names()
token_indexes = list()
for token in search_tokens:
    if token in feature_names:
        index = feature_names.index(token)
        token_indexes.append(index)
if len(token_indexes) == 0:
    # No search term found in the feature names. Return no results.
    print("No results")
else:
    print(token_indexes)
    matching_scores = np.zeros(2467)
    for index in token_indexes:
        matching_scores = np.add(matching_scores, tfidf_matrix[:,index])

print(matching_scores)
print(matching_scores.shape)
[26788, 12444]
[0.         0.         0.         ... 0.01912304 0.         0.        ]
(2467,)
# Create a dataframe with title,url,matching_scores,eigen vector values
search_dataframe = final_ted_df[['title','url', 'eigenvector_value']]
search_dataframe = search_dataframe.assign(matching_scores=matching_scores)
search_dataframe['total_score'] = 0.5 * search_dataframe['matching_scores'] + 0.5 * search_dataframe['eigenvector_value']
search_dataframe.sort_values(['total_score'], ascending=[False], inplace = True)
# Show the top 5 search results
search_dataframe.head(5)[['title','url']]
title url
2426 Courage is contagious https://www.ted.com/talks/damon_davis_what_i_saw_at_the_ferguson_protests\n
1681 I'm not your inspiration, thank you very much https://www.ted.com/talks/stella_young_i_m_not_your_inspiration_thank_you_very_much\n
2239 It's time for women to run for office https://www.ted.com/talks/halla_tomasdottir_it_s_time_for_women_to_run_for_office\n
2196 A new way to heal hearts without surgery https://www.ted.com/talks/franz_freudenthal_a_new_way_to_heal_hearts_without_surgery\n
800 The power of vulnerability https://www.ted.com/talks/brene_brown_on_vulnerability\n
</div>